library(tidyverse) # for graphing and data cleaning
library(tidymodels) # for modeling
library(naniar) # for analyzing missing values
library(vip) # for variable importance plots
library(glmnet) # for regularized regression, including LASSO
theme_set(theme_minimal())
hotels <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-11/hotels.csv')
Setting Up Git and Github in RStudio
Here is my Github link.
Creating a Website
- Website Link
- Here is the link to my website.
Machine Learning review and intro to tidymodels
- Read about the hotel booking data,
hotels, on the Tidy Tuesday page it came from. There is also a link to an article from the original authors. The outcome we will be predicting is called is_canceled.
- Without doing any analysis, what are some variables you think might be predictive and why?
- There are a few variables that could be predictive, however
previous_cancellations definitely stands out. It is reasonable to assume that if someone has canceled before, they could possiblky cancel again. booking_changes could also be predictive, as someone who makes a bunch of changes is likely to be unsure about their booking and one of the changes that they could make could be canceling the stay. Finally, a third variable that could be predictive is customer_type, as the type of customer could make it easier or more difficult to cancel the stay.
- What are some problems that might exist with the data? You might think about how it was collected and who did the collecting.
- One issue with the way that the data was collected is that there are almost twice as many observations for the city hotel as the resort hotel, which could introduce bias into the data. The data for canceled bookings could also be less accurate, as variables such as adults, children, and babies could be inaccurate due to the family or group never showing up. Therefore, there could be bias towards non-canceled bookings based on how the data was collected. The
reservation_status variable is also a bit redundant, as it states whether someone canceled their reservation or not, which is already given in the is_canceled variable. Finally, each reservation is missing a unique identifier, so
- If we construct a model, what type of conclusions will be able to draw from it?
- If we construct a model, the type of conclusions that we’ll be able to draw from it are likely to be which variables are most important to determine the likelihood of cancellation. This could easily be achieved using the LASSO technique to analyze variable importance.
- Create some exploratory plots or table summaries of the data, concentrating most on relationships with the response variable. Keep in mind the response variable is numeric, 0 or 1. You may want to make it categorical (you also may not). Be sure to also examine missing values or other interesting values.
hotels %>%
ggplot(aes(x = hotel)) +
geom_bar(fill = "blue") +
facet_wrap(vars(is_canceled))

hotels %>%
ggplot(aes(x = customer_type)) +
geom_bar(fill = "red") +
facet_wrap(vars(is_canceled))

hotels %>%
ggplot(aes(x = previous_cancellations)) +
geom_bar(fill = "orange") +
xlim(0,3) +
ylim(0,7000) +
facet_wrap(vars(is_canceled))
## Warning: Removed 252 rows containing non-finite values (stat_count).
## Warning: Removed 4 rows containing missing values (geom_bar).

hotels %>%
ggplot(aes(x = adults)) +
geom_bar(fill = "green") +
xlim(0,5) +
facet_wrap(vars(is_canceled))
## Warning: Removed 14 rows containing non-finite values (stat_count).
## Warning: Removed 3 rows containing missing values (geom_bar).

hotels %>%
ggplot(aes(x = children)) +
geom_bar(fill = "green") +
xlim(0,5) +
ylim(0,2500) +
facet_wrap(vars(is_canceled))
## Warning: Removed 5 rows containing non-finite values (stat_count).
## Warning: Removed 3 rows containing missing values (geom_bar).

hotels %>%
ggplot(aes(x = babies)) +
geom_bar(fill = "green") +
xlim(0,5) +
ylim(0,200) +
facet_wrap(vars(is_canceled))
## Warning: Removed 2 rows containing non-finite values (stat_count).
## Warning: Removed 3 rows containing missing values (geom_bar).

- First, we will do a couple things to get the data ready, including making the outcome a factor (needs to be that way for logistic regression), removing the year variable and some reservation status variables, and removing missing values (not NULLs but true missing values). Split the data into a training and test set, stratifying on the outcome variable,
is_canceled. Since we have a lot of data, we’re going to split the data 50/50 between training and test. I have already set.seed() for you. Be sure to use hotels_mod in the splitting.
hotels_mod <- hotels %>%
mutate(is_canceled = as.factor(is_canceled)) %>%
mutate(across(where(is.character), as.factor)) %>%
select(-arrival_date_year,
-reservation_status,
-reservation_status_date) %>%
add_n_miss() %>%
filter(n_miss_all == 0) %>%
select(-n_miss_all)
set.seed(494)
hotel_split <- initial_split(hotels_mod,
prop = 0.5)
hotel_train <- training(hotel_split)
hotel_test <- testing(hotel_split)
- Pre-processing
hotel_recipe <- recipe(is_canceled ~.,
data = hotel_train) %>%
step_mutate_at(children, babies, previous_cancellations,
fn = ~ifelse(. > 0, 1, 0)) %>%
step_mutate_at(agent, company,
fn = ~ifelse(. == "NULL", 1, 0)) %>%
step_mutate(country = fct_lump_n(country, 5)) %>%
step_normalize(all_numeric()) %>%
step_dummy(all_nominal(), -all_outcomes())
hotel_recipe %>%
prep(hotel_train) %>%
juice()
- LASSO model and workflow
- We would want to use a LASSO workflow because LASSO uses an importance coefficient which reduces to zero when the variable is deemed to be not predictive of our outcome variable. This will allow us to reduce the size of our dataset and only focus on our indicator variables that matter.
# lasso_hotel_mod <- linear_reg(penalty = tune(), mode = "regression") %>%
# set_engine("glmnet")
#
# workflow <- workflow() %>%
# add_recipe(hotel_recipe) %>%
# add_model(lasso_hotel_mod)
#
# lasso_hotel_fit <- workflow %>%
# fit(data = hotel_train)
---
title: 'Assignment #1'
output: 
  html_document:
    toc: true
    toc_float: true
    df_print: paged
    code_download: true
---

```{r setup, include=FALSE}
#knitr::opts_chunk$set(echo = TRUE, message=FALSE, warning=FALSE)
```

```{r libraries, message=FALSE, warning=FALSE}
library(tidyverse)         # for graphing and data cleaning
library(tidymodels)        # for modeling
library(naniar)            # for analyzing missing values
library(vip)               # for variable importance plots
library(glmnet)            # for regularized regression, including LASSO
```

```{r theme}
theme_set(theme_minimal())
```

```{r data, cache=TRUE, message=FALSE}
hotels <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-11/hotels.csv')
```


## Setting Up Git and Github in RStudio

[Here](https://github.com/alexdenzler/STAT494_site_Denzler) is my Github link.


## Creating a Website

* Website Link
  + [Here](https://upbeat-hawking-b9ef26.netlify.app) is the link to my website.
  
* 


## Machine Learning review and intro to `tidymodels`

(@) Read about the hotel booking data, `hotels`, on the Tidy Tuesday page it came from. There is also a link to an article from the original authors. The outcome we will be predicting is called `is_canceled`.

* Without doing any analysis, what are some variables you think might be predictive and why?
  + There are a few variables that could be predictive, however `previous_cancellations` definitely stands out. It is reasonable to assume that if someone has canceled before, they could possiblky cancel again. `booking_changes` could also be predictive, as someone who makes a bunch of changes is likely to be unsure about their booking and one of the changes that they could make could be canceling the stay. Finally, a third variable that could be predictive is `customer_type`, as the type of customer could make it easier or more difficult to cancel the stay. 
  
* What are some problems that might exist with the data? You might think about how it was collected and who did the collecting.
  + One issue with the way that the data was collected is that there are almost twice as many observations for the city hotel as the resort hotel, which could introduce bias into the data. The data for canceled bookings could also be less accurate, as variables such as adults, children, and babies could be inaccurate due to the family or group never showing up. Therefore, there could be bias towards non-canceled bookings based on how the data was collected. The `reservation_status` variable is also a bit redundant, as it states whether someone canceled their reservation or not, which is already given in the `is_canceled` variable. Finally, each reservation is missing a unique identifier, so 
  
* If we construct a model, what type of conclusions will be able to draw from it?
  + If we construct a model, the type of conclusions that we'll be able to draw from it are likely to be which variables are most important to determine the likelihood of cancellation. This could easily be achieved using the LASSO technique to analyze variable importance.
  

(@) Create some exploratory plots or table summaries of the data, concentrating most on relationships with the response variable. Keep in mind the response variable is numeric, 0 or 1. You may want to make it categorical (you also may not). Be sure to also examine missing values or other interesting values.

```{r}
hotels %>% 
  ggplot(aes(x = hotel)) + 
  geom_bar(fill = "blue") + 
  facet_wrap(vars(is_canceled))
```
```{r}
hotels %>% 
  ggplot(aes(x = customer_type)) + 
  geom_bar(fill = "red") +
  facet_wrap(vars(is_canceled))
```

```{r}
hotels %>% 
  ggplot(aes(x = previous_cancellations)) +
  geom_bar(fill = "orange") +
  xlim(0,3) +
  ylim(0,7000) +
  facet_wrap(vars(is_canceled))
```


```{r}
hotels %>% 
  ggplot(aes(x = adults)) + 
  geom_bar(fill = "green") +
  xlim(0,5) +
  facet_wrap(vars(is_canceled))
```
```{r}
hotels %>% 
  ggplot(aes(x = children)) + 
  geom_bar(fill = "green") +
  xlim(0,5) +
  ylim(0,2500) +
  facet_wrap(vars(is_canceled))
```

```{r}
hotels %>% 
  ggplot(aes(x = babies)) + 
  geom_bar(fill = "green") +
  xlim(0,5) +
  ylim(0,200) +
  facet_wrap(vars(is_canceled))
```


(@) First, we will do a couple things to get the data ready, including making the outcome a factor (needs to be that way for logistic regression), removing the year variable and some reservation status variables, and removing missing values (not NULLs but true missing values). Split the data into a training and test set, stratifying on the outcome variable, `is_canceled`. Since we have a lot of data, we’re going to split the data 50/50 between training and test. I have already `set.seed()` for you. Be sure to use `hotels_mod` in the splitting.

```{r}
hotels_mod <- hotels %>% 
  mutate(is_canceled = as.factor(is_canceled)) %>% 
  mutate(across(where(is.character), as.factor)) %>% 
  select(-arrival_date_year,
         -reservation_status,
         -reservation_status_date) %>% 
  add_n_miss() %>% 
  filter(n_miss_all == 0) %>% 
  select(-n_miss_all)

set.seed(494)
```

```{r}
hotel_split <- initial_split(hotels_mod,
                             prop = 0.5)
hotel_train <- training(hotel_split)
hotel_test <- testing(hotel_split)
```

(@) Pre-processing

```{r}
hotel_recipe <- recipe(is_canceled ~.,
                       data = hotel_train) %>% 
  step_mutate_at(children, babies, previous_cancellations,
                 fn = ~ifelse(. > 0, 1, 0)) %>% 
    step_mutate_at(agent, company,
                  fn = ~ifelse(. == "NULL", 1, 0)) %>% 
      step_mutate(country = fct_lump_n(country, 5)) %>% 
        step_normalize(all_numeric()) %>% 
          step_dummy(all_nominal(), -all_outcomes())

hotel_recipe %>% 
  prep(hotel_train) %>% 
  juice()
```


(@) LASSO model and workflow
* We would want to use a LASSO workflow because LASSO uses an importance coefficient which reduces to zero when the variable is deemed to be not predictive of our outcome variable. This will allow us to reduce the size of our dataset and only focus on our indicator variables that matter.

 ```{r}
# lasso_hotel_mod <- linear_reg(penalty = tune(), mode = "regression") %>%
#   set_engine("glmnet")
# 
# workflow <- workflow() %>% 
#   add_recipe(hotel_recipe) %>% 
#   add_model(lasso_hotel_mod)
# 
# lasso_hotel_fit <- workflow %>% 
#   fit(data = hotel_train)
 ```




